-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding temperature scaling on Joiner logits: #789
Adding temperature scaling on Joiner logits: #789
Conversation
- T hard-coded to 2.0 - so far best result NCE 0.122 (still not so high) - the BPE scores were rescaled with 0.2 (but then also incorrect words get high confidence, visually reasonable histograms are for 0.5 scale) - BPE->WORD score merging done by min(.) function (tried also prob-product, and also arithmetic, geometric, harmonic mean) - without temperature scaling (i.e. scale 1.0), the best NCE was 0.032 (here product merging was best) Results seem consistent with: https://arxiv.org/abs/2110.15222 Everything tuned on a very-small set of 100 sentences with 813 words and 10.2% WER, a Czech model. I also experimented with blank posteriors mixed into the BPE confidences, but no NCE improvement found, so not pushing that. Temperature scling added also to the Greedy search confidences.
|
||
// copy raw logits, apply temperature-scaling (for confidences) | ||
int32_t p_logit_items = vocab_size * num_hyps; | ||
std::vector<float> logit_with_temperature(p_logit_items); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change p_logit in-place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it cannot be done in-place
the idea is to apply temperature only for computation of confidences,
the decoding continues to use the original values
this is why the logit values are copied to a new buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
Thanks!
Could you make it configurable and give it a default value 1.0 (like what we are doing for blank penalty)? |
okay, working on it i am not sure about the default 1.0, for 1.0 the confidences have worse quality than for 2.0, |
1.0 is for backward compatibility.
In that case, 2.0 is fine with me. |
okay, the T parameter is now configurable |
there seems to be some problem with the workload tests, many fail with the 503 error pointing to huggingface URL otherwise, it should be ready for a code reivew (tested with a local client, and it works as expected) |
ce6e5b5
to
8b57f73
Compare
i fixed an error in the android build |
the tests look OK, seeing only unrelated errors |
Thank you for your contribution! |
* Adding temperature scaling on Joiner logits: - T hard-coded to 2.0 - so far best result NCE 0.122 (still not so high) - the BPE scores were rescaled with 0.2 (but then also incorrect words get high confidence, visually reasonable histograms are for 0.5 scale) - BPE->WORD score merging done by min(.) function (tried also prob-product, and also arithmetic, geometric, harmonic mean) - without temperature scaling (i.e. scale 1.0), the best NCE was 0.032 (here product merging was best) Results seem consistent with: https://arxiv.org/abs/2110.15222 Everything tuned on a very-small set of 100 sentences with 813 words and 10.2% WER, a Czech model. I also experimented with blank posteriors mixed into the BPE confidences, but no NCE improvement found, so not pushing that. Temperature scling added also to the Greedy search confidences. * making `temperature_scale` configurable from outside
T is configurable
so far best result NCE 0.122 (still not so high)
without temperature scaling (i.e. scale 1.0), the best NCE was 0.032 (here product merging was the best)
Results seem consistent with: https://arxiv.org/abs/2110.15222
Everything tuned on a very small set of 100 sentences with 813 words and 10.2% WER, a Czech model.
I also experimented with blank posteriors mixed into the BPE confidences, but no NCE improvement found, so not pushing that.
Temperature scling added also to the Greedy search confidences.